Skip to content

[vllm] [cpu] [sagemaker] Add vLLM CPU inference image for SageMaker#5670

Open
timelfrink wants to merge 2 commits intoaws:masterfrom
timelfrink:feature/vllm-cpu-sagemaker
Open

[vllm] [cpu] [sagemaker] Add vLLM CPU inference image for SageMaker#5670
timelfrink wants to merge 2 commits intoaws:masterfrom
timelfrink:feature/vllm-cpu-sagemaker

Conversation

@timelfrink
Copy link

@timelfrink timelfrink commented Feb 13, 2026

GitHub Issue #, if available: N/A

Description

Add vLLM CPU-only inference image for SageMaker. Enables running vLLM on CPU instances for reranking, scoring, embeddings, and small generative models.

  • vllm/x86_64/cpu/Dockerfile.cpu — Multi-stage build from ubuntu:22.04, compiles vLLM v0.15.1 with VLLM_TARGET_DEVICE=cpu. Uses tcmalloc + Intel OpenMP, Python 3.12 via uv, reuses shared sagemaker_entrypoint.sh.
  • vllm/buildspec-cpu-sm.yml — Buildspec for CPU SageMaker target. Tag: 0.15.1-cpu-py312-ubuntu22.04-sagemaker.

Manual testing on EC2 (c5.4xlarge):

  • Image builds successfully (~3.5 GB)
  • /health, /ping return 200
  • /v1/completions works (facebook/opt-125m)
  • /score and /invocations work with reranker (Alibaba-NLP/gte-multilingual-reranker-base)

Tests Run

/buildspec vllm/buildspec-cpu-sm.yml
/tests sanity security

Formatting

  • N/A — No Python files in this PR (Dockerfile + YAML only)

PR Checklist

  • I've prepended PR tag with frameworks/job this applies to : [vllm] | [cpu] | [sagemaker]
  • This PR is fully backward compatible with pre-existing code
  • I've documented the DLC image/dockerfile this relates to
  • I've documented the tests I've run on the DLC image
  • I've reviewed the licenses of new binaries and dependencies

By submitting this pull request, I confirm that my contribution is made under the terms of the Apache 2.0 license. I confirm that you can use, modify, copy, and redistribute this contribution, under the terms of your choice.

timelfrink and others added 2 commits February 13, 2026 08:57
Add support for vLLM CPU inference on SageMaker, aligned with official
vLLM CPU Dockerfile patterns.

Features:
- Multi-stage build: base → vllm-build → vllm-cpu → sagemaker
- Uses uv package manager for fast dependency installation
- Python 3.12 via uv (not limited to system python)
- Build caching with --mount=type=cache for apt, uv, ccache
- Wheel-based install (build wheel, then install separately)
- Uses official vLLM requirements files (cpu.txt, cpu-build.txt)
- Intel OpenMP + tcmalloc for x86_64 CPU performance
- gcc-12 as explicit compiler version

New files:
- vllm/x86_64/cpu/Dockerfile.cpu: Multi-stage Dockerfile
- vllm/buildspec-cpu-sm.yml: Build configuration for SageMaker

Expected image tag: vllm:0.15.1-cpu-py312-ubuntu22.04-sagemaker

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
- Set image_size_baseline to 5000 (actual image ~3.5GB)
- Add ulimit -c 0 to disable core dumps (matches upstream)
@timelfrink timelfrink force-pushed the feature/vllm-cpu-sagemaker branch from 7523dee to eb25cb6 Compare February 13, 2026 07:57
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant